Patterns of Importance Variation in Spoken Dialog

نویسندگان

  • Nigel Ward
  • Karen Richart-Ruiz
چکیده

Some things people say are more important, and some less so. The ability to automatically judge this, even approximately, would be a useful front end for many applications. This paper empirically examines importance as it varies from moment to moment in spoken dialog. Contextual prosodic features are informative, and importance is frequently associated with specific patterns of interaction that involve both participants and stretch over several seconds. A simple linear regression model gave importance estimates that correlated well, 0.83, with human judgments. 1 Importance in Language and Dialog Not everything people say to each other is equally important: a dialog may include many ums and uhs with almost no significance, but also content words or nuances that are critical in one way or another. Many language processing applications rely on the ability to detect what is important in the input stream, including not only dialog systems, but also systems for summarization, information retrieval, information extraction, and so on. Today this is primarily done using task-specific heuristics, such as discarding stopwords, giving more weight to low frequency words, or favoring utterances with high average pitch. In this paper, however, we attempt to develop a general, task-independent notion of importance. Our specific target application is voice codecs, as used in telecommunications. Today’s codecs treat all speech as equally valuable, transmitting it at the same quality level, and, to a first approximation, devoting the same number of bits to every frame. Instead we would like to transmit voice with more accuracy to the extent that it is more important, and less to the extent that it matters less, frame by frame. This should enable higher perceived call quality with no increase in average datarate. As a straw in the wind, we generated two alternative encodings of a 4-second dialog segment, one with the lower quality used for the less important parts, and one with the lower quality used for the more important parts, and found in informal presentations to colleagues that the former was strongly preferred. To actually build an importance-adaptive codec we need to understand how importance varies in dialog, and develop models that predict in real time how important the next frame will be. In this paper we approach this problem from a dialog perspective, looking for patterns in how participants in a dialog signal to each other what is important and unimportant. Section 2 explains our empirical approach, based on 100 minutes of spoken dialogs annotated for importance, mostly word-by-word. Sections 3 and 4 identify correlates of importance including individual prosodic features and longer prosodic patterns of interaction. Section 5 describes a predictive model which uses this information to produce importance estimates which correlate .88 with human judgments. Section 6 analyzes the shortcomings of this model, Section 7 discusses the prospects for realtime importance prediction, and Section 8 summarizes the significance and future work needed. 2 Annotating Importance We have found no standard definition of importance useful for describing what happens, moment-by-moment, in spoken dialog. The closest contender would be entropy, as defined in information theory. For text we can measure the difficulty of guessing letters or words, as a measure of their unpredictability and thus informativeness (Shannon, 1951), but this is indirect, time-consuming, and impossible to apply to nonsymbolic aspects of language. We can also measure how certain information, if present, helps improve the accuracy of predictions, as a measure of its value, but again this is indirect and timeconsuming (Ward and Walker, 2009). We therefore chose to do an empirical study, using a pretheoretic notion of importance, while leaving its formalization for future work. We hired a student to annotate importance. Wanting to capture her naive judgments, atheoretically, we did not precisely define importance for her. Instead we discussed the concept briefly, noting that importance may be judged: not just in terms of content but also in terms of value for directing the future course of the dialog, not just from the speaker’s perspective but also from the listener’s, and not just from the words said but also from how they were said. While importance often seems to be a property of utterances, there are also often cases where it varies within utterances, word by word, or even within words. While in principle one could go so far as to examine importance at the level of phonemes or below, doing so would quickly lose connection with naive intuitions. So mostly wordlevel annotations seemed appropriate. Our annotator used a labeling tool that let her navigate back and forth in the dialogs, listen to the speakers together in stereo or independently, delimit regions of whatever size she wanted, and ascribe to each region an importance value. While importance is continuous, for convenience we had her use discrete values: the whole numbers from 0 to 5, with 5 indicating highest importance, 4 typical importance, 3 somewhat less importance, 2 and 1 even less, and 0 silence. Wanting to have a variety of speakers, topics, and speaking styles, we chose to have her annotate dialogs from the Switchboard corpus (Godfrey et al., 1992). After she had done 10 minutes of dialog we checked over her work. To do this the second author independently labeled the same dialogs, and then examined the places where her labels differed from the annotator’s by more than one point. These disagreements were of four main types: 1. differences due to variation in the placement of boundaries between regions, which arose because the audio was not pre-segmented. One type of difference was variation in the marking of words’ exact endpoints. Another type involved within-word variation, for example during words stretched out while the speaker decided how to continue. In such cases the importance seemed to steadily decrease, but the use of discrete labels required the annotators to arbitrarily chose a timepoint where to mark a drop. 2. differences arising because the annotator sometimes missed small quiet sounds, especially quiet backchannels that overlapped talk by the main speaker and pre-turn noisy inbreaths. 3. differences in the treatment of repetitions. The annotator tended to ascribe the same importance to both renditions of repeated words, for example in cases of false starts. Logically, one or the other is redundant and thus less important. The second author tended to consider the second rendition more important, as it was generally more fluent and clear, but one could also argue for the importance of the first rendition, from turn-taking considerations, for example. 4. differences in the ratings of backchannels. Although low in content, these are known to be important for the flow of the dialog. Given our current level of knowledge, and in particular the lack of any reason to consider our opinions more valid than hers, we chose not to change the labels or procedure. Instead we just sat down with the annotator, discussed the differences, and asked her to pay a little more attention to these aspects for the remainder of the labeling. In total, she labeled both tracks of just over 100 minutes of dialog. As expected, there was diversity in labels, supporting our belief that importance is not monotone: the largest fraction of nonzero-labeled regions, covering cumulatively 38% of the total time, was at level 4, but there were also 20% at level 3 and 37% at level 5. In general importance was variable, on average the importance staying at the same level for only 1.5 seconds. It was also “lumpy” rather than smooth, with an average distance between two level-5 regions in the same track of 10.4 seconds. Figure 1 illustrates. In parallel, the second author continued labeling until she had annotated 17 minutes of dialog. The correspondence between the two sets of laFigure 1: Importance versus Time, in milliseconds. Rectangular line: Annotator judgments; Jagged line: Predictions (discussed below). The words are all by one speaker, horizontally positioned by approximate occurrence. level 0 level 1 level 2 level 3 level 4 level 5 totals level

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spoken language variation over time and state in a natural spoken dialog system

We are interested in adaptive spoken dialog systems for automated services. Peoples’ spoken language usage varies over time for a fixed task, and furthermore varies depending on the state of the dialog. We will characterize and quantify this variation based on a database of 20K user-transactions with AT&T’s experimental ‘How May I Help You?’ spoken dialog system. We then report on a language ad...

متن کامل

Discourse marker use in task-oriented spoken dialog \lambda

Discourse markers, also known as clue words, are used extensively in human-human task-oriented dialogs to signal the structure of the discourse. Previous work showed their importance in monologs and social conversations for marking discourse structure , but little attention has been paid to their importance in spoken dialog systems. This paper investigates what discourse markers signal about th...

متن کامل

On the Role of Discourse Markers in Interactive Spoken Question Answering Systems

This paper presents a preliminary analysis of the role of some discourse markers and the vocalic hesitation euh in a corpus of spoken human utterances collected with the RITEL system, an open domain and spoken dialog system. The frequency and contextual combination patterns of classical discourse markers and of the vocalic hesitation has been studied. This analysis highlights some specificities...

متن کامل

Caller Response Timing Patterns in Spoken Dialog Systems

This paper contains an analysis of caller response timing patterns in spoken dialog systems. The findings presented here are based on data from live commercial dialog systems. It is shown that caller responses after a system finished playing the prompt resemble a uni-modal distribution and can be modeled with a rational distribution function. This finding allows understanding when callers tend ...

متن کامل

Using responsive prosodic variation to acknowledge the user's current state

Spoken dialog systems today do not vary the prosody of their utterances, although prosody is known to have many useful expressive functions. In a corpus of memory quizzes, we identify eleven dimensions of prosodic variation, each with its own expressive function. We identified the situations in which each was used, and developed rules for detecting these situations from the dialog context and t...

متن کامل

Discriminative state tracking for spoken dialog systems

In spoken dialog systems, statistical state tracking aims to improve robustness to speech recognition errors by tracking a posterior distribution over hidden dialog states. Current approaches based on generative or discriminative models have different but important shortcomings that limit their accuracy. In this paper we discuss these limitations and introduce a new approach for discriminative ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013